feat: implemented sampling for MTP#1
Conversation
|
Hi, thanks for this! I think this is similar to, though probably a cleaner version than, what I had before I changed the MTP sampler to a simple argmax. I'll tell you what I think, let me know to what extent this does/does not agree with your understanding:
|
Hi @F1LM1, thanks for the feedback! My bad that I didn't see your previous commits of a proper sampling implementation. Based on the discussion, I was thinking this was still on your to-do list. After digging into the code, I think I have a clearer picture. Regarding common_sampler_accept: I see my mistake. In the standard speculative mode with a separate draft model (ctx_dft), the accept call inside the draft function is used to maintain the state of the draft context for sequential token generation. But for MTP, since we share a single context and sampler, calling it prematurely would indeed pollute the main sampler's state before verification. Regarding the "Modify Logits + Greedy" Strategy: I'm already drafting an idea for that, which involves accessing the sample method to modify the logits and then getting the first candidate. This is my concept: // In common/speculative.cpp
llama_token mtp_speculative_gen_draft(
struct common_sampler* smpl,
struct llama_context* ctx,
llama_token id_last,
int32_t n_past,
int32_t last_tok_idx) {
if (!smpl) {
return -1;
}
llama_batch batch = llama_batch_init(1, 0, 1);
common_batch_add(batch, id_last, n_past, {0}, true);
llama_build_and_execute_mtp_graph(ctx, batch, id_last, n_past, last_tok_idx);
const llama_model * model = llama_get_model(ctx);
const llama_vocab * vocab = llama_model_get_vocab(model);
const int n_vocab = llama_n_vocab(vocab);
llama_token_data_array * cur_p = common_sampler_get_candidates(smpl);
cur_p->size = n_vocab;
for (int i = 0; i < n_vocab; ++i) {
cur_p->data[i].id = i;
cur_p->data[i].logit = llama_get_logits_ith(ctx, last_tok_idx)[i];
}
cur_p->sorted = false;
common_sampler_apply_chain(smpl, cur_p);
const llama_token id = cur_p->data[0].id;
...
// In common/sampling.cpp
void common_sampler_apply_chain(struct common_sampler * gsmpl, struct llama_token_data_array * cur_p) {
llama_sampler_apply(gsmpl->chain, cur_p);
}I'll test this out a bit, but in the meantime, I'm open to feedback. I also have another question: I'm looking at how to implement the
|
This seems reasonable to me, with luck it will show up as better acceptance rates when rep penalties are turned on :)
Haven't really started thinking about this. When I have some free time I plan to focus on seeing if we can do some basic optimizations like graph reuse and such. You're definitely welcome to work on this!
Frankly I don't know exactly what the multi-head case for MTP would look like, but my impression that you cannot MTP draft N tokens simply by autoregressively predicting with a single MTP head the way you can for a typical draft model. Rather, I believe that the number of MTP heads is a fixed feature of the model/weights, so that if you wanted to draft say N = 5 tokens at once the model would have to have at least N = 5 MTP layers/heads that would all produce outputs in a single forward pass of the full model (including MTP). I would've guessed that each MTP layer takes as input the previous layer's output embedding and its sampled token (to concatenate as an input embedding the way we do for the single MTP head here), and rather than having to run But if you find material to the contrary, I would absolutely love to see it. |
Hey! I tried with 52 requests, between 6k to 38k of context and was able to get an acceptance rate of ~0.5931 +/- 0.041 as compared with the previous rate of ~0.51 that I reported before. This was with the same settings (temp=1.0, DRY enabled) for writing. The latest commit includes these changes.
I was unable to find proper documentation or even discussion, only mentions to look at SGLang and VLLM, so I did look at how VLLM implemented and their approach is as follows: self.num_mtp_layers = config.num_nextn_predict_layers
self.layers = torch.nn.ModuleDict({
str(idx): Glm4MoeMultiTokenPredictorLayer(...)
for idx in range(self.mtp_start_layer_idx,
self.mtp_start_layer_idx + self.num_mtp_layers)
})
...
def forward(..., spec_step_idx: int = 0):
...
current_step_idx = (spec_step_idx % self.num_mtp_layers)
return self.layers[str(self.mtp_start_layer_idx + current_step_idx)](...)So if we pass for example My proposed plan now shifts to two steps:
|
Great, I've been away last couple of days but I'll give this a spin as well, sounds promising!
If I'm reading this correctly it looks like if num_mtp_layers = 1 then it will run the one MTP layer autoregressively, but if num_mtp_layers = 2 for example then it will alternate between the layers? That seems... odd, but I agree it can't hurt to match their implementation until we have an example of a model with num_mtp_layers > 1 to if it works. Hopefully we'll see decent draft acceptance at least for the "easy" cases (coding), and even if not, it's easy enough to just recommend choosing the N that ends up working best. |
Yes, the alternating layer logic seems odd. I felt the same way, especially since we've only seen models with a single MTP head, and the previous layers don't have the # in vllm/spec_decode/eagle.py
class EagleProposer:
...
def propose(self, ...):
...
# Generate the remaining draft tokens.
draft_token_ids_list = [draft_token_ids]
for _ in range(self.num_speculative_tokens - 1):
# The input for this iteration is the token generated in the previous one.
input_ids = draft_token_ids_list[-1].int()
# Runs the model for a single step
last_hidden_states, hidden_states = self.model(...)
# Calculates logits and samples the next token (with argmax)
logits = self.model.compute_logits(last_hidden_states[:batch_size], None)
draft_token_ids = logits.argmax(dim=-1)
# Appends the new token for the next iteration
draft_token_ids_list.append(draft_token_ids)This is essentially what our |
|
As I commented on the other PR, I suspect that supporting KV cache for multi-token MTP drafts is going to be a significant step up in complexity, while the one token case we can piggyback on the existing KV cache system (since it sets aside cache for the single MTP layer already). I'll get a chance tomorrow to spin up this PR but I think this should represent the optimal sampling subroutine. If you're eager to finish it off, maybe start thinking about how we can make the setup more efficient by reusing stuff where possible (memory ctx? graphs? >1 size batches?), since we're basically recreating a bunch of stuff from scratch for every token. |
Okay, I'll take a look at what you suggested and find ways to store the state for the context and graph. Regarding the batch size part, if I understand correctly, you're referring to fixing the alternation between draft and main model tokens in the server's main loop. I agree that would be a great optimization, but it seems like it would take a while, change a lot of the server logic, and require extensive testing. I feel like that would be a good follow-up PR. It's more of a general feature to improve not only MTP but drafting in general, and giving it a separate PR would allow us to merge the core MTP implementation first. |
Nah, I meant some form of batching when we do the MTP layer prompt processing step, since we're likely going to process hundreds of tokens at once using the same graphs/memory context/etc. Right now we're building all 1-size batches, which just feels wrong. AFAIK, the alternation thing might be deceptively easy to fix. I suspect it could be as simple as just making sure we only do the non-speculative llama_decode step exactly once, i.e. immediately after prompt processing. I'll need to find my notes on this, but I'm pretty sure everything else is always correctly synced, at least for the MTP case. |
Ah, thanks for clarifying the batching part! I was looking at the main generation loop, but batching the MTP prompt processing makes sense. I'll keep that on my list of things to look into.
That's a great point about the alternation fix potentially being simple; I'm looking forward to your notes on that when you find them. In the meantime, I've been focused on your first suggestion: reusing resources during the single-token draft generation to avoid recreating everything from scratch. I've been experimenting with a My latest attempt was to build the graph once inside My diagnosis is that the It feels like a chicken-and-egg problem. To properly reuse the graph, we'd need to persist the scheduler, but to persist the scheduler, we'd need to persist the memory context, which is the core of the KV cache problem. I'm probably missing something due to my limited knowledge of the Any pointers would be be a huge help. |
|
I finally got a chance to test the improved sampler. It works well in my testing, raising draft acceptance rate in some "hard" writing scenarios by more than 10 percentage points on average, which is a clear and large gain. Ironically it ends up being slower in actual tok/s generation, presumably because the sampling chain as-is is inefficient, but let's see what we can do about that in follow-ups. Re: the graph reuse questions you mentioned above, I'll fire up the project again this weekend and see what I find. It's been a while since I dove in. |
…gml-org#16038) Initalizing RESERVED_NAME in is_reserved_name() is not thread safe and leads to corrupted memory when used from multiple threads as can be seen in the asan trace below. This fixes the initialization to make it thread-safe. #0 0x000100abd018 in std::__1::pair<std::__1::__hash_iterator<std::__1::__hash_node<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, void*>*>, bool> std::__1::__hash_table<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::hash<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::equal_to<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>>::__emplace_unique_key_args<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&>(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&) __hash_table:1565 F1LM1#1 0x000100ab0320 in SchemaConverter::visit(nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&) json-schema-to-grammar.cpp:802 F1LM1#2 0x000100aafc48 in std::__1::__function::__func<build_grammar(std::__1::function<void (common_grammar_builder const&)> const&, common_grammar_options const&)::$_2, std::__1::allocator<build_grammar(std::__1::function<void (common_grammar_builder const&)> const&, common_grammar_options const&)::$_2>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> (std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&)>::operator()(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&) function.h:319 F1LM1#3 0x000100a2c938 in std::__1::__function::__func<common_chat_params_init_llama_3_x(minja::chat_template const&, templates_params const&, bool)::$_0::operator()(common_grammar_builder const&) const::'lambda'(nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&), std::__1::allocator<common_chat_params_init_llama_3_x(minja::chat_template const&, templates_params const&, bool)::$_0::operator()(common_grammar_builder const&) const::'lambda'(nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&)>, void (nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&)>::operator()(nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&) function.h:319 F1LM1#4 0x000100a139f8 in foreach_function(nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&, std::__1::function<void (nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&)> const&) chat.cpp:762 F1LM1#5 0x000100a2a7f4 in std::__1::__function::__func<common_chat_params_init_llama_3_x(minja::chat_template const&, templates_params const&, bool)::$_0, std::__1::allocator<common_chat_params_init_llama_3_x(minja::chat_template const&, templates_params const&, bool)::$_0>, void (common_grammar_builder const&)>::operator()(common_grammar_builder const&) function.h:319 F1LM1#6 0x000100aa98f4 in build_grammar(std::__1::function<void (common_grammar_builder const&)> const&, common_grammar_options const&) json-schema-to-grammar.cpp:982 F1LM1#7 0x0001009c9314 in common_chat_params_init_llama_3_x(minja::chat_template const&, templates_params const&, bool) chat.cpp:1110 F1LM1#8 0x0001009b8afc in common_chat_templates_apply_jinja(common_chat_templates const*, common_chat_templates_inputs const&) chat.cpp:1992 ggml-org#9 0x0001009b533c in common_chat_templates_apply(common_chat_templates const*, common_chat_templates_inputs const&) chat.cpp:2074 ggml-org#10 0x000100810120 in llamacpp_apply_chat_template+0x724 (predict_oai-98384e17fb94e863:arm64+0x100090120) ... ==45482==Register values: x[0] = 0x00006020004147f8 x[1] = 0x00006080000013c8 x[2] = 0x0000000000000000 x[3] = 0x0000604006289738 x[4] = 0x0000000000000002 x[5] = 0x0000000000000001 x[6] = 0x04034000004b4000 x[7] = 0x0000000000000001 x[8] = 0xbebebebebebebebe x[9] = 0x17d7d7d7d7d7d7d7 x[10] = 0x00000c04000828ff x[11] = 0x0000000000000001 x[12] = 0x000000002018d383 x[13] = 0x0000000000000000 x[14] = 0xfa0000000000fafa x[15] = 0x000010700001ffff x[16] = 0x000000019dc012c0 x[17] = 0x00000001021284f8 x[18] = 0x0000000000000000 x[19] = 0x00000001700acdc0 x[20] = 0x0000000000000002 x[21] = 0x000000002018d384 x[22] = 0x16dd16fd2e731151 x[23] = 0x0000007000020000 x[24] = 0x0000000100c69c08 x[25] = 0x0000000100c69c20 x[26] = 0x00006080000013c7 x[27] = 0x0000000100c69c00 x[28] = 0x00000001700acd60 fp = 0x00000001700aceb0 lr = 0x0000000100abce30 sp = 0x00000001700acd60 AddressSanitizer can not provide additional info. SUMMARY: AddressSanitizer: SEGV __hash_table:1565 in std::__1::pair<std::__1::__hash_iterator<std::__1::__hash_node<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, void*>*>, bool> std::__1::__hash_table<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::hash<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::equal_to<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>>::__emplace_unique_key_args<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&>(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&) Thread T5 created by T0 here: #0 0x0001020b99d4 in pthread_create+0x5c (libclang_rt.asan_osx_dynamic.dylib:arm64e+0x359d4) F1LM1#1 0x000100873910 in std::sys::pal::unix::thread::Thread::new::h77254fdd87a28e05+0x118 (predict_oai-98384e17fb94e863:arm64+0x1000f3910) F1LM1#2 0x0001007c7a1c in test::run_test::haeb3c2bcd5ed6cf6+0x76c (predict_oai-98384e17fb94e863:arm64+0x100047a1c) F1LM1#3 0x0001007aedb0 in test::console::run_tests_console::he9d142d704f3a986+0x149c (predict_oai-98384e17fb94e863:arm64+0x10002edb0) F1LM1#4 0x0001007c5758 in test::test_main::hf86a5e20735245b9+0x118 (predict_oai-98384e17fb94e863:arm64+0x100045758) F1LM1#5 0x0001007c5da0 in test::test_main_static::h61ee9c8fd30abca0+0x54 (predict_oai-98384e17fb94e863:arm64+0x100045da0) ... ==45482==ABORTING
commit 912ed2cd9339d1b2875d98744ca5b51fa62e581e Author: samuel <[email protected]> Date: Sun Dec 7 23:00:29 2025 -0300 speculative (feat): implement recursive MTP drafting for GLM-4.5 commit bdf72d9 Author: samuel <[email protected]> Date: Sat Dec 6 16:10:16 2025 -0300 sampling (feat): optimize speculative drafting with fast-path selection commit a91980a Author: samuel <[email protected]> Date: Sat Dec 6 15:18:19 2025 -0300 mtp (chore): clean old code commit 6de0ecf Author: samuel <[email protected]> Date: Sat Dec 6 14:40:13 2025 -0300 mtp (feat): add mtp arg commit ea77394 Author: samuel <[email protected]> Date: Sat Dec 6 13:47:54 2025 -0300 mtp-graph (fix): move llama_get_logits_ith outside the loop commit 15dff20 Merge: 171346c cae85fe Author: samuel <[email protected]> Date: Thu Oct 16 13:44:41 2025 -0300 Merge branch 'glm4-mtp-batch' of https://github.com/SamuelOliveirads/llama.cpp into glm4-mtp-graph-cache commit cae85fe Author: samuel <[email protected]> Date: Thu Oct 16 13:42:31 2025 -0300 mtp-batch(fix): avoid logits for mtp kv cache operations commit 171346c Author: samuel <[email protected]> Date: Sun Oct 12 16:33:01 2025 -0300 mtp-graph(feat): Reactivate graph reuse only for main model path commit 0127c6b Author: samuel <[email protected]> Date: Sat Oct 11 22:20:54 2025 -0300 mtp-batch(chore): Remove final MTP debug logs and dead code commit 4bcc9e2 Author: samuel <[email protected]> Date: Sat Oct 11 18:51:22 2025 -0300 mtp-batch(fix): Correctly advance cache head and add MTP documentation commit b4cbe03 Author: samuel <[email protected]> Date: Sat Oct 11 18:37:40 2025 -0300 mtp-batch(chore): Fix logit flags for speculative sampling and remove debug logs commit a99709d Author: samuel <[email protected]> Date: Fri Oct 10 17:24:34 2025 -0300 mtp-batch(refactor): Extract decode context and MTP input logic into helper methods commit 913af8f Author: samuel <[email protected]> Date: Fri Oct 10 16:44:28 2025 -0300 mtp-batch(refactor): Replace MTP boolean flags with an explicit operation enum commit 6f74ba3 Author: samuel <[email protected]> Date: Thu Oct 9 22:27:18 2025 -0300 mtp-batch (fix): prevent mtp draft from polluting the cache commit 5e1d719 Author: samuel <[email protected]> Date: Thu Oct 9 15:21:23 2025 -0300 mtp-batch (feat): Create and manage sinfo for MTP commit febd823 Author: samuel <[email protected]> Date: Sun Oct 5 14:43:40 2025 -0300 mtp-batch (wip): fix how to warmup kv cache for MTP commit 67c6c06 Author: samuel <[email protected]> Date: Sat Sep 27 19:42:32 2025 -0300 mtp-batch (wip): Isolate MTP graph to prevent host embedding buffer corruption commit 75dc25e Author: samuel <[email protected]> Date: Sat Sep 27 17:17:00 2025 -0300 mtp-batch (wip): organize batch for mtp cache commit 3da7e7f Author: samuel <[email protected]> Date: Tue Sep 23 22:45:11 2025 -0300 mtp-batch (fix): warm mtp cache for small batch size commit df64508 Author: samuel <[email protected]> Date: Sun Sep 21 21:55:41 2025 -0300 mtp-batch (wip): merge glm graphs commit 042eb8a Author: samuel <[email protected]> Date: Sun Sep 21 21:29:00 2025 -0300 mtp-batch (wip): merge mtp and model graph commit 1318b2d Author: samuel <[email protected]> Date: Sun Sep 14 10:22:59 2025 -0300 mtp-batch (wip): move mtp execution to batch format commit c6237c7 Merge: 9fab53e 8742ce0 Author: Aaron Lee <[email protected]> Date: Sat Sep 13 02:57:01 2025 -0400 Merge pull request F1LM1#1 from SamuelOliveirads/glm4-moe-mtp feat: implemented sampling for MTP commit 8742ce0 Author: samuel <[email protected]> Date: Sat Sep 6 00:21:18 2025 -0300 feat: apply logits + greedy sampler commit 5a5bce8 Author: samuel <[email protected]> Date: Wed Sep 3 17:56:14 2025 -0300 fix: add sample acceptance commit 07670a2 Author: samuel <[email protected]> Date: Wed Sep 3 13:25:21 2025 -0300 feat: implemented sampling for MTP commit 9fab53e Author: Aaron Lee <[email protected]> Date: Tue Sep 2 17:14:09 2025 -0400 fixed mtp kv cache update step in cases where prompt size > n_batch and n_ubatch commit 98bc0c6 Author: Aaron Lee <[email protected]> Date: Tue Aug 26 01:26:51 2025 -0400 replace standard sampler with greedy sampler for mtp draft commit 471e026 Author: Aaron Lee <[email protected]> Date: Tue Aug 19 23:10:56 2025 -0400 fixed vram leak commit d72f9d5 Author: Aaron Lee <[email protected]> Date: Tue Aug 19 01:50:34 2025 -0400 kludge-y kv cache management of mtp layer commit 382135a Author: Aaron Lee <[email protected]> Date: Sun Aug 17 21:54:45 2025 -0400 fixed mtp kv cache update sequencing after prompt processing commit 6870f97 Author: Aaron Lee <[email protected]> Date: Sun Aug 17 04:59:36 2025 -0400 added proper KV cache management for MTP layers and slightly refactored commit 6e9bafc Author: Aaron Lee <[email protected]> Date: Fri Aug 15 23:13:56 2025 -0400 failed attempt to implement MTP; outputs tokens but KV cache management is unreasonable commit cf0f7c0 Author: Aaron Lee <[email protected]> Date: Wed Aug 13 02:21:17 2025 -0400 broad thrust of the mtp implementation commit 03231da Author: Aaron Lee <[email protected]> Date: Tue Aug 12 01:03:59 2025 -0400 add model member function to build mtp graph, to be called from speculative.cpp commit 1f477b3 Author: Aaron Lee <[email protected]> Date: Mon Aug 11 20:54:45 2025 -0400 make nextn weights loadable without a crash commit e434f87 Author: Aaron Lee <[email protected]> Date: Mon Aug 11 01:21:47 2025 -0400 some work towards building mtp layer graph commit db60623 Author: Aaron Lee <[email protected]> Date: Sun Aug 10 23:52:54 2025 -0400 added getter for nextn layer count and server slot has_mtp property
commit 912ed2cd9339d1b2875d98744ca5b51fa62e581e Author: samuel <[email protected]> Date: Sun Dec 7 23:00:29 2025 -0300 speculative (feat): implement recursive MTP drafting for GLM-4.5 commit bdf72d9 Author: samuel <[email protected]> Date: Sat Dec 6 16:10:16 2025 -0300 sampling (feat): optimize speculative drafting with fast-path selection commit a91980a Author: samuel <[email protected]> Date: Sat Dec 6 15:18:19 2025 -0300 mtp (chore): clean old code commit 6de0ecf Author: samuel <[email protected]> Date: Sat Dec 6 14:40:13 2025 -0300 mtp (feat): add mtp arg commit ea77394 Author: samuel <[email protected]> Date: Sat Dec 6 13:47:54 2025 -0300 mtp-graph (fix): move llama_get_logits_ith outside the loop commit 15dff20 Merge: 171346c cae85fe Author: samuel <[email protected]> Date: Thu Oct 16 13:44:41 2025 -0300 Merge branch 'glm4-mtp-batch' of https://github.com/SamuelOliveirads/llama.cpp into glm4-mtp-graph-cache commit cae85fe Author: samuel <[email protected]> Date: Thu Oct 16 13:42:31 2025 -0300 mtp-batch(fix): avoid logits for mtp kv cache operations commit 171346c Author: samuel <[email protected]> Date: Sun Oct 12 16:33:01 2025 -0300 mtp-graph(feat): Reactivate graph reuse only for main model path commit 0127c6b Author: samuel <[email protected]> Date: Sat Oct 11 22:20:54 2025 -0300 mtp-batch(chore): Remove final MTP debug logs and dead code commit 4bcc9e2 Author: samuel <[email protected]> Date: Sat Oct 11 18:51:22 2025 -0300 mtp-batch(fix): Correctly advance cache head and add MTP documentation commit b4cbe03 Author: samuel <[email protected]> Date: Sat Oct 11 18:37:40 2025 -0300 mtp-batch(chore): Fix logit flags for speculative sampling and remove debug logs commit a99709d Author: samuel <[email protected]> Date: Fri Oct 10 17:24:34 2025 -0300 mtp-batch(refactor): Extract decode context and MTP input logic into helper methods commit 913af8f Author: samuel <[email protected]> Date: Fri Oct 10 16:44:28 2025 -0300 mtp-batch(refactor): Replace MTP boolean flags with an explicit operation enum commit 6f74ba3 Author: samuel <[email protected]> Date: Thu Oct 9 22:27:18 2025 -0300 mtp-batch (fix): prevent mtp draft from polluting the cache commit 5e1d719 Author: samuel <[email protected]> Date: Thu Oct 9 15:21:23 2025 -0300 mtp-batch (feat): Create and manage sinfo for MTP commit febd823 Author: samuel <[email protected]> Date: Sun Oct 5 14:43:40 2025 -0300 mtp-batch (wip): fix how to warmup kv cache for MTP commit 67c6c06 Author: samuel <[email protected]> Date: Sat Sep 27 19:42:32 2025 -0300 mtp-batch (wip): Isolate MTP graph to prevent host embedding buffer corruption commit 75dc25e Author: samuel <[email protected]> Date: Sat Sep 27 17:17:00 2025 -0300 mtp-batch (wip): organize batch for mtp cache commit 3da7e7f Author: samuel <[email protected]> Date: Tue Sep 23 22:45:11 2025 -0300 mtp-batch (fix): warm mtp cache for small batch size commit df64508 Author: samuel <[email protected]> Date: Sun Sep 21 21:55:41 2025 -0300 mtp-batch (wip): merge glm graphs commit 042eb8a Author: samuel <[email protected]> Date: Sun Sep 21 21:29:00 2025 -0300 mtp-batch (wip): merge mtp and model graph commit 1318b2d Author: samuel <[email protected]> Date: Sun Sep 14 10:22:59 2025 -0300 mtp-batch (wip): move mtp execution to batch format commit c6237c7 Merge: 9fab53e 8742ce0 Author: Aaron Lee <[email protected]> Date: Sat Sep 13 02:57:01 2025 -0400 Merge pull request F1LM1#1 from SamuelOliveirads/glm4-moe-mtp feat: implemented sampling for MTP commit 8742ce0 Author: samuel <[email protected]> Date: Sat Sep 6 00:21:18 2025 -0300 feat: apply logits + greedy sampler commit 5a5bce8 Author: samuel <[email protected]> Date: Wed Sep 3 17:56:14 2025 -0300 fix: add sample acceptance commit 07670a2 Author: samuel <[email protected]> Date: Wed Sep 3 13:25:21 2025 -0300 feat: implemented sampling for MTP commit 9fab53e Author: Aaron Lee <[email protected]> Date: Tue Sep 2 17:14:09 2025 -0400 fixed mtp kv cache update step in cases where prompt size > n_batch and n_ubatch commit 98bc0c6 Author: Aaron Lee <[email protected]> Date: Tue Aug 26 01:26:51 2025 -0400 replace standard sampler with greedy sampler for mtp draft commit 471e026 Author: Aaron Lee <[email protected]> Date: Tue Aug 19 23:10:56 2025 -0400 fixed vram leak commit d72f9d5 Author: Aaron Lee <[email protected]> Date: Tue Aug 19 01:50:34 2025 -0400 kludge-y kv cache management of mtp layer commit 382135a Author: Aaron Lee <[email protected]> Date: Sun Aug 17 21:54:45 2025 -0400 fixed mtp kv cache update sequencing after prompt processing commit 6870f97 Author: Aaron Lee <[email protected]> Date: Sun Aug 17 04:59:36 2025 -0400 added proper KV cache management for MTP layers and slightly refactored commit 6e9bafc Author: Aaron Lee <[email protected]> Date: Fri Aug 15 23:13:56 2025 -0400 failed attempt to implement MTP; outputs tokens but KV cache management is unreasonable commit cf0f7c0 Author: Aaron Lee <[email protected]> Date: Wed Aug 13 02:21:17 2025 -0400 broad thrust of the mtp implementation commit 03231da Author: Aaron Lee <[email protected]> Date: Tue Aug 12 01:03:59 2025 -0400 add model member function to build mtp graph, to be called from speculative.cpp commit 1f477b3 Author: Aaron Lee <[email protected]> Date: Mon Aug 11 20:54:45 2025 -0400 make nextn weights loadable without a crash commit e434f87 Author: Aaron Lee <[email protected]> Date: Mon Aug 11 01:21:47 2025 -0400 some work towards building mtp layer graph commit db60623 Author: Aaron Lee <[email protected]> Date: Sun Aug 10 23:52:54 2025 -0400 added getter for nextn layer count and server slot has_mtp property
commit 912ed2cd9339d1b2875d98744ca5b51fa62e581e Author: samuel <[email protected]> Date: Sun Dec 7 23:00:29 2025 -0300 speculative (feat): implement recursive MTP drafting for GLM-4.5 commit bdf72d9 Author: samuel <[email protected]> Date: Sat Dec 6 16:10:16 2025 -0300 sampling (feat): optimize speculative drafting with fast-path selection commit a91980a Author: samuel <[email protected]> Date: Sat Dec 6 15:18:19 2025 -0300 mtp (chore): clean old code commit 6de0ecf Author: samuel <[email protected]> Date: Sat Dec 6 14:40:13 2025 -0300 mtp (feat): add mtp arg commit ea77394 Author: samuel <[email protected]> Date: Sat Dec 6 13:47:54 2025 -0300 mtp-graph (fix): move llama_get_logits_ith outside the loop commit 15dff20 Merge: 171346c cae85fe Author: samuel <[email protected]> Date: Thu Oct 16 13:44:41 2025 -0300 Merge branch 'glm4-mtp-batch' of https://github.com/SamuelOliveirads/llama.cpp into glm4-mtp-graph-cache commit cae85fe Author: samuel <[email protected]> Date: Thu Oct 16 13:42:31 2025 -0300 mtp-batch(fix): avoid logits for mtp kv cache operations commit 171346c Author: samuel <[email protected]> Date: Sun Oct 12 16:33:01 2025 -0300 mtp-graph(feat): Reactivate graph reuse only for main model path commit 0127c6b Author: samuel <[email protected]> Date: Sat Oct 11 22:20:54 2025 -0300 mtp-batch(chore): Remove final MTP debug logs and dead code commit 4bcc9e2 Author: samuel <[email protected]> Date: Sat Oct 11 18:51:22 2025 -0300 mtp-batch(fix): Correctly advance cache head and add MTP documentation commit b4cbe03 Author: samuel <[email protected]> Date: Sat Oct 11 18:37:40 2025 -0300 mtp-batch(chore): Fix logit flags for speculative sampling and remove debug logs commit a99709d Author: samuel <[email protected]> Date: Fri Oct 10 17:24:34 2025 -0300 mtp-batch(refactor): Extract decode context and MTP input logic into helper methods commit 913af8f Author: samuel <[email protected]> Date: Fri Oct 10 16:44:28 2025 -0300 mtp-batch(refactor): Replace MTP boolean flags with an explicit operation enum commit 6f74ba3 Author: samuel <[email protected]> Date: Thu Oct 9 22:27:18 2025 -0300 mtp-batch (fix): prevent mtp draft from polluting the cache commit 5e1d719 Author: samuel <[email protected]> Date: Thu Oct 9 15:21:23 2025 -0300 mtp-batch (feat): Create and manage sinfo for MTP commit febd823 Author: samuel <[email protected]> Date: Sun Oct 5 14:43:40 2025 -0300 mtp-batch (wip): fix how to warmup kv cache for MTP commit 67c6c06 Author: samuel <[email protected]> Date: Sat Sep 27 19:42:32 2025 -0300 mtp-batch (wip): Isolate MTP graph to prevent host embedding buffer corruption commit 75dc25e Author: samuel <[email protected]> Date: Sat Sep 27 17:17:00 2025 -0300 mtp-batch (wip): organize batch for mtp cache commit 3da7e7f Author: samuel <[email protected]> Date: Tue Sep 23 22:45:11 2025 -0300 mtp-batch (fix): warm mtp cache for small batch size commit df64508 Author: samuel <[email protected]> Date: Sun Sep 21 21:55:41 2025 -0300 mtp-batch (wip): merge glm graphs commit 042eb8a Author: samuel <[email protected]> Date: Sun Sep 21 21:29:00 2025 -0300 mtp-batch (wip): merge mtp and model graph commit 1318b2d Author: samuel <[email protected]> Date: Sun Sep 14 10:22:59 2025 -0300 mtp-batch (wip): move mtp execution to batch format commit c6237c7 Merge: 9fab53e 8742ce0 Author: Aaron Lee <[email protected]> Date: Sat Sep 13 02:57:01 2025 -0400 Merge pull request #1 from SamuelOliveirads/glm4-moe-mtp feat: implemented sampling for MTP commit 8742ce0 Author: samuel <[email protected]> Date: Sat Sep 6 00:21:18 2025 -0300 feat: apply logits + greedy sampler commit 5a5bce8 Author: samuel <[email protected]> Date: Wed Sep 3 17:56:14 2025 -0300 fix: add sample acceptance commit 07670a2 Author: samuel <[email protected]> Date: Wed Sep 3 13:25:21 2025 -0300 feat: implemented sampling for MTP commit 9fab53e Author: Aaron Lee <[email protected]> Date: Tue Sep 2 17:14:09 2025 -0400 fixed mtp kv cache update step in cases where prompt size > n_batch and n_ubatch commit 98bc0c6 Author: Aaron Lee <[email protected]> Date: Tue Aug 26 01:26:51 2025 -0400 replace standard sampler with greedy sampler for mtp draft commit 471e026 Author: Aaron Lee <[email protected]> Date: Tue Aug 19 23:10:56 2025 -0400 fixed vram leak commit d72f9d5 Author: Aaron Lee <[email protected]> Date: Tue Aug 19 01:50:34 2025 -0400 kludge-y kv cache management of mtp layer commit 382135a Author: Aaron Lee <[email protected]> Date: Sun Aug 17 21:54:45 2025 -0400 fixed mtp kv cache update sequencing after prompt processing commit 6870f97 Author: Aaron Lee <[email protected]> Date: Sun Aug 17 04:59:36 2025 -0400 added proper KV cache management for MTP layers and slightly refactored commit 6e9bafc Author: Aaron Lee <[email protected]> Date: Fri Aug 15 23:13:56 2025 -0400 failed attempt to implement MTP; outputs tokens but KV cache management is unreasonable commit cf0f7c0 Author: Aaron Lee <[email protected]> Date: Wed Aug 13 02:21:17 2025 -0400 broad thrust of the mtp implementation commit 03231da Author: Aaron Lee <[email protected]> Date: Tue Aug 12 01:03:59 2025 -0400 add model member function to build mtp graph, to be called from speculative.cpp commit 1f477b3 Author: Aaron Lee <[email protected]> Date: Mon Aug 11 20:54:45 2025 -0400 make nextn weights loadable without a crash commit e434f87 Author: Aaron Lee <[email protected]> Date: Mon Aug 11 01:21:47 2025 -0400 some work towards building mtp layer graph commit db60623 Author: Aaron Lee <[email protected]> Date: Sun Aug 10 23:52:54 2025 -0400 added getter for nextn layer count and server slot has_mtp property
ggml-org#958) * port upstream ggml-org#16932 * Add fixed chat templates. * fix grammar when tool have no argument * Insert additional stops for Kimi-K2 * Fix `no triggers set for lazy grammar!` for GLM4.5/4.6 * update chat.cpp * fix grammar for GLM 4.5/4.6 * chat: Fix streaming parser for granite models (ggml-org#15682) * fix(chat): fix streaming parser for granite models * tests: add test cases for Granite models chat parser * common : Fix corrupted memory error on json grammar initialization (ggml-org#16038) Initalizing RESERVED_NAME in is_reserved_name() is not thread safe and leads to corrupted memory when used from multiple threads as can be seen in the asan trace below. This fixes the initialization to make it thread-safe. #0 0x000100abd018 in std::__1::pair<std::__1::__hash_iterator<std::__1::__hash_node<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, void*>*>, bool> std::__1::__hash_table<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::hash<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::equal_to<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>>::__emplace_unique_key_args<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&>(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&) __hash_table:1565 F1LM1#1 0x000100ab0320 in SchemaConverter::visit(nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&) json-schema-to-grammar.cpp:802 F1LM1#2 0x000100aafc48 in std::__1::__function::__func<build_grammar(std::__1::function<void (common_grammar_builder const&)> const&, common_grammar_options const&)::$_2, std::__1::allocator<build_grammar(std::__1::function<void (common_grammar_builder const&)> const&, common_grammar_options const&)::$_2>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> (std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&)>::operator()(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&) function.h:319 F1LM1#3 0x000100a2c938 in std::__1::__function::__func<common_chat_params_init_llama_3_x(minja::chat_template const&, templates_params const&, bool)::$_0::operator()(common_grammar_builder const&) const::'lambda'(nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&), std::__1::allocator<common_chat_params_init_llama_3_x(minja::chat_template const&, templates_params const&, bool)::$_0::operator()(common_grammar_builder const&) const::'lambda'(nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&)>, void (nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&)>::operator()(nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&) function.h:319 F1LM1#4 0x000100a139f8 in foreach_function(nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&, std::__1::function<void (nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&)> const&) chat.cpp:762 F1LM1#5 0x000100a2a7f4 in std::__1::__function::__func<common_chat_params_init_llama_3_x(minja::chat_template const&, templates_params const&, bool)::$_0, std::__1::allocator<common_chat_params_init_llama_3_x(minja::chat_template const&, templates_params const&, bool)::$_0>, void (common_grammar_builder const&)>::operator()(common_grammar_builder const&) function.h:319 F1LM1#6 0x000100aa98f4 in build_grammar(std::__1::function<void (common_grammar_builder const&)> const&, common_grammar_options const&) json-schema-to-grammar.cpp:982 F1LM1#7 0x0001009c9314 in common_chat_params_init_llama_3_x(minja::chat_template const&, templates_params const&, bool) chat.cpp:1110 F1LM1#8 0x0001009b8afc in common_chat_templates_apply_jinja(common_chat_templates const*, common_chat_templates_inputs const&) chat.cpp:1992 ggml-org#9 0x0001009b533c in common_chat_templates_apply(common_chat_templates const*, common_chat_templates_inputs const&) chat.cpp:2074 ggml-org#10 0x000100810120 in llamacpp_apply_chat_template+0x724 (predict_oai-98384e17fb94e863:arm64+0x100090120) ... ==45482==Register values: x[0] = 0x00006020004147f8 x[1] = 0x00006080000013c8 x[2] = 0x0000000000000000 x[3] = 0x0000604006289738 x[4] = 0x0000000000000002 x[5] = 0x0000000000000001 x[6] = 0x04034000004b4000 x[7] = 0x0000000000000001 x[8] = 0xbebebebebebebebe x[9] = 0x17d7d7d7d7d7d7d7 x[10] = 0x00000c04000828ff x[11] = 0x0000000000000001 x[12] = 0x000000002018d383 x[13] = 0x0000000000000000 x[14] = 0xfa0000000000fafa x[15] = 0x000010700001ffff x[16] = 0x000000019dc012c0 x[17] = 0x00000001021284f8 x[18] = 0x0000000000000000 x[19] = 0x00000001700acdc0 x[20] = 0x0000000000000002 x[21] = 0x000000002018d384 x[22] = 0x16dd16fd2e731151 x[23] = 0x0000007000020000 x[24] = 0x0000000100c69c08 x[25] = 0x0000000100c69c20 x[26] = 0x00006080000013c7 x[27] = 0x0000000100c69c00 x[28] = 0x00000001700acd60 fp = 0x00000001700aceb0 lr = 0x0000000100abce30 sp = 0x00000001700acd60 AddressSanitizer can not provide additional info. SUMMARY: AddressSanitizer: SEGV __hash_table:1565 in std::__1::pair<std::__1::__hash_iterator<std::__1::__hash_node<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, void*>*>, bool> std::__1::__hash_table<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::hash<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::equal_to<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>>::__emplace_unique_key_args<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&>(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&) Thread T5 created by T0 here: #0 0x0001020b99d4 in pthread_create+0x5c (libclang_rt.asan_osx_dynamic.dylib:arm64e+0x359d4) F1LM1#1 0x000100873910 in std::sys::pal::unix::thread::Thread::new::h77254fdd87a28e05+0x118 (predict_oai-98384e17fb94e863:arm64+0x1000f3910) F1LM1#2 0x0001007c7a1c in test::run_test::haeb3c2bcd5ed6cf6+0x76c (predict_oai-98384e17fb94e863:arm64+0x100047a1c) F1LM1#3 0x0001007aedb0 in test::console::run_tests_console::he9d142d704f3a986+0x149c (predict_oai-98384e17fb94e863:arm64+0x10002edb0) F1LM1#4 0x0001007c5758 in test::test_main::hf86a5e20735245b9+0x118 (predict_oai-98384e17fb94e863:arm64+0x100045758) F1LM1#5 0x0001007c5da0 in test::test_main_static::h61ee9c8fd30abca0+0x54 (predict_oai-98384e17fb94e863:arm64+0x100045da0) ... ==45482==ABORTING * common : fix reasoning before forced tool call via tool_choice = required (ggml-org#16264) * common : fix reasoning before forced tool call via tool_choice = required * common : improve reasoning and commentary handling when tool_choice is required (cherry picked from commit c746984) --------- Co-authored-by: Alde Rojas <[email protected]> * Try fix Jinja template for GLM * Improve Kimi-K2 chat template * Fix "Invalid tool call arguments passed" in a rare case. In a rare case, the model may emit a raw string that begins with a valid JSON string. This commit adds unit tests to cover that scenario and fixes the regression introduced during the Kimi-K2 adaptation. --------- Co-authored-by: shun095 <[email protected]> Co-authored-by: David Ribeiro Alves <[email protected]> Co-authored-by: crat0z <[email protected]> Co-authored-by: Alde Rojas <[email protected]>
Hi @F1LM1, I've been following your PR and decided to tackle one of the to-dos you mentioned: implementing proper sampling for the MTP draft model.
I've successfully implemented a solution that retrieves the full logits from the MTP and passes them to the sampler for the draft token generation. The code seems stable and is ready for your review.
Here are the key results from my testing:
Interestingly, when compiled in release mode, I achieved an average acceptance rate of 0.51 for creative tasks, which as you mentioned in your PR it was around of 0.4.
I tried to preserve your code and I'm open to suggestions you have for improvement.